62
Algorithms for Binary Neural Networks
tors. The discriminators try to distinguish the “real” from the “fake,” and the generator
tries to make the discriminators unable to work well. The result is a rectified process and a
unique architecture with a more precise estimation of the full precision model. Pruning is
also explored to improve the applicability of the 1-bit model in practical applications in the
GAN framework. To accomplish this, we integrate quantization and pruning into a unified
framework.
3.6.1
Loss Function
The rectification process combines full precision kernels and feature maps to rectify the
binarization process. It includes kernel approximation and adversarial learning. This learn-
able kernel approximation leads to a unique architecture with a precise estimation of the
convolutional filters by minimizing kernel loss. Discriminators D(·) with filters Y are intro-
duced to distinguish feature maps R of the full precision model from those T of RBCN. The
RBCN generator with filters W and matrices C is trained with Y using knowledge of the
supervised feature maps R. In summary, W, C and Y are learned by solving the following
optimization problem:
arg min
W, ˆ
W ,C
max
Y
L = LAdv(W, ˆW, C, Y ) + LS(W, ˆW, C) + LKernel(W, ˆW, C),
(3.62)
where LAdv(W, ˆW, C, Y ) is the adversarial loss as
LAdv(W, ˆW, C, Y ) = log(D(R; Y )) + log(1 −D(T; Y )),
(3.63)
where D(·) consists of a series of basic blocks, each containing linear and LeakyRelu layers.
We also have multiple discriminators to rectify the binarization training process.
In addition, LKernel(W, ˆW, C) denotes the kernel loss between the learned full precision
filters W and the binarized filters ˆW and is defined as:
LKernel(W, ˆW, C) = λ1/2||W −C ˆW||2,
(3.64)
where λ1 is a balance parameter. Finally, LS is a traditional problem-dependent loss, such
as softmax loss. The adversarial, kernel, and softmax loss are regularizations on L .
For simplicity, the update of the discriminators is omitted in the following description
until Algorithm 13. We also have omitted log(·) and rewritten the optimization in Eq. 6.79
as in Eq. 3.65 for simplicity.
min
W, ˆ
W ,C
LS(W, ˆW, C) + λ1/2
l
i
||W l
i −Cl ˆW l
i ||2 +
l
i
||1 −D(T l
i ; Y )||2.
(3.65)
where i represents the ith channel and l represents the lth layer. In Eq. 3.65, the objective
is to obtain W, ˆW and C with Y fixed, which is why the term D(R; Y ) in Eq. 6.79 can
be ignored. The update process for Y is found in Algorithm 13. The advantage of our
formulation in Eq. 3.65 lies in that the loss function is trainable, which means it can be
easily incorporated into existing learning frameworks.
3.6.2
Learning RBCNs
In RBCNs, convolution is implemented using W l, Cl and F l
in to calculate output feature
maps F l
out as
F l
out = RBConv(F l
in; ˆW l, Cl) = Conv(F l
in, ˆW l ⊙Cl),
(3.66)